fix(backend): bound first-file cache wait in /v1/sync/audio urls endpoint (#7325)#7815
fix(backend): bound first-file cache wait in /v1/sync/audio urls endpoint (#7325)#7815kodjima33 wants to merge 3 commits into
Conversation
Greptile SummaryThis PR fixes a production hang on
Confidence Score: 4/5The fix correctly bounds the first-file wait to 15 s and the executor choice avoids deadlock. Safe to merge, with two minor gaps worth addressing before heavy production use. The core logic is sound: sync_executor is the right pool (its chunk-download children run on storage_executor), the 15s budget is well under the 120s middleware timeout, and the pending fallback is already handled by clients. Two issues reduce confidence slightly: the except FuturesTimeoutError block does not catch concurrent.futures.CancelledError, so a queued-but-cancelled future during shutdown produces a 500 instead of a graceful pending response; and the structural test only checks != storage_executor (negative), so any other wrong executor choice would go undetected. backend/routers/sync.py (exception handling around lines 297-303) and backend/tests/unit/test_sync_urls_bounded_wait.py (pool assertion at lines 64-66) Important Files Changed
Reviews (1): Last reviewed commit: "test(backend): register test_sync_urls_b..." | Re-trigger Greptile |
| assert ( | ||
| m.group(1) != 'storage_executor' | ||
| ), "coordinator must not share storage_executor with its chunk-download children" |
There was a problem hiding this comment.
The assertion only guards the negative case. If a future refactor changes the pool to
postprocess_executor, db_executor, or any other executor, this test still passes — including pools whose workers could deadlock or whose purpose clashes with audio-merge work. Asserting positively on sync_executor catches any such drift.
| assert ( | |
| m.group(1) != 'storage_executor' | |
| ), "coordinator must not share storage_executor with its chunk-download children" | |
| assert ( | |
| m.group(1) == 'sync_executor' | |
| ), "coordinator must run on sync_executor (not storage_executor, to avoid deadlock with its children)" |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| try: | ||
| first_file_future.result(timeout=FIRST_FILE_CACHE_WAIT_SECONDS) | ||
| except FuturesTimeoutError: | ||
| logger.warning( | ||
| f"sync_urls first-file cache exceeded {FIRST_FILE_CACHE_WAIT_SECONDS}s, returning pending " | ||
| f"uid={uid} convo={conversation_id} file={audio_file_id}" | ||
| ) |
There was a problem hiding this comment.
CancelledError not caught alongside FuturesTimeoutError
future.result() can also raise concurrent.futures.CancelledError — a separate exception class from FuturesTimeoutError — when the future is cancelled before it starts executing. This happens when sync_executor is at capacity and shutdown_executors(cancel_futures=True) fires during a graceful shutdown: the queued task is cancelled and result() raises rather than blocking. The exception propagates unhandled through the endpoint and becomes a 500 instead of the graceful pending response that would be correct here. Catching (FuturesTimeoutError, CancelledError) in the same handler would close the gap.
Bug
#7325 — audio download endpoints hang with 0 bytes (90–300s, then timeout). Confirmed live in prod LB logs:
GET /v1/sync/audio/{id}/urlsrepeatedly took 60–115s on 2026-06-10 alone; anything slower hits the 120sTimeoutMiddleware/ client timeouts and surfaces as the reported hang.Root cause
get_audio_signed_urls_endpointmerges the first uncached audio file synchronously in the request thread ("cache synchronously for immediate playback",routers/sync.py). For large recordings (hundreds of chunks) or under GCS pushback,download_audio_chunks_and_mergetakes minutes, so the request blocks with zero bytes sent.(The 429s the reporter saw are produced at the edge — those requests never reach backend pods, no app-level rate limit exists on this path — so the limiter half of the issue is infra config, out of scope here.)
Fix
Bound the wait: submit
_precache_audio_filetosync_executor(notstorage_executor— its chunk-download children run there; deadlock rule 3) and wait at mostFIRST_FILE_CACHE_WAIT_SECONDS = 15s. Healthy merges (0.1–5s per prod logs) behave exactly as before. On timeout the merge keeps running and lands in the GCS cache; the file is returned aspendingwith a warning log.Clients already handle
pending: the Flutter player falls back to the direct stream URL, and the web detail panel precaches + refetches. A fastpendingbeats a 60–115s block or an opaque 504.Tests
tests/unit/test_sync_urls_bounded_wait.py(registered intest.sh): structural guards (boundedresult(timeout=), no inline merge, coordinator not onstorage_executor, budget < middleware timeout) + logic tests of the wait/timeout primitive (timed-out merge still completes in background). All 6 verified locally via stdlib harness (pytest unavailable locally; CI runs the suite). py_compile,lint_async_blockers, black clean. UNVERIFIED at runtime against a live large-conversation merge.🤖 automated by hourly watchdog; opened for review, not merged.
Fixes #7325